[SPARK-19602][SQL] Support column resolution of fully qualified column name ( 3 part name)#17185
[SPARK-19602][SQL] Support column resolution of fully qualified column name ( 3 part name)#17185skambha wants to merge 7 commits intoapache:masterfrom
Conversation
|
cc @gatorsmile, @cloud-fan I'd really appreciate your review and comments. Thanks much. |
There was a problem hiding this comment.
I'm not sure if this is a valid case. We should treat SELECT ... FROM ... as a 2-levels look up: Level 1, in FROM statement, we specify the table name with an optional qualifier(database name). Level 2, in SELECT statement, we specify the column name with an optional qualifier(table name).
I don't think it makes sense to specify the database name for column name, the table lookup should be done in the FROM statement.
There was a problem hiding this comment.
Yes, but @skambha did a comparison between different DB. Basically, what she tried to do is to make Spark SQL consistent with the others. See the design doc:
https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf
There was a problem hiding this comment.
Thanks @gatorsmile and @cloud-fan.
Hi @cloud-fan,
The use-case is to support fully qualified column name in queries to remove ambiguity in the naming.
For e.g. select db2.t1.i1 from db1.t1, db2.t1 where db1.t1.i2 = db2.t1.i3
Today Spark does not support this. The workaround is to create alias for the table.
Yes, the lookup for the table is coming from the ‘from’ clause. At a high level, changes include keeping track of the qualifier(including the database name) when the lookup happens and enhancing the column resolution logic in the analyzer to resolve fully qualified column name from a logical plan node’s children.
There are more details in the design doc attached to the SPARK-19602
Thanks.
|
I agree it's a valid use case, do you wanna bring it up to date? sorry for the delay! |
|
sure. Let me look into it. Thanks. |
There was a problem hiding this comment.
I think we can just use Seq[String]
There was a problem hiding this comment.
I followed the style that exists today. If we change here, then there are other places where it needs to change as well. For e.g, in AttributeReference we initialize it with None today and now it will be a Seq.empty. I am not sure if we want to create these objects or just leave it using None by keeping the Option[]
If you have a strong preference, I can update it. I do have the changes locally but have not pushed that version. Thanks.
dc69a85 to
8206dc3
Compare
|
I have rebased and pushed the changes. I ran the unit tests ( sql, catalyst and hive). Please take a look. Thanks. |
|
ok to test |
|
Test build #86899 has finished for PR 17185 at commit
|
|
Any plans of merging this PR? |
|
@mafernandez-stratio We plan to review and merge after 2.3 release. The target release will be 2.4 |
|
@gatorsmile Great, thanks for the feedback! |
|
ok to test |
|
Test build #93810 has finished for PR 17185 at commit
|
There was a problem hiding this comment.
what about SELECT db1.t1.* FROM t1 while the current database is db1?
There was a problem hiding this comment.
This (SELECT db1.t1.* FROM t1) would resolve correctly.
Also a similar scenario (details in Section 2 in Table A # 5 in the design doc)
select db1.t1.i1 from t1 will resolve correctly when the current database is db1. This is tested here
I will explicitly add a test case for the db1.t1.* from t1 as well.
There was a problem hiding this comment.
I think we can make SubqueryAlias take a qualifier: Seq[String] instead of alias: String.
There was a problem hiding this comment.
Here, SubqueryAlias takes in a AliasIdentifier and not a string.
Also want to mention that: we considered somewhat similar to your suggestion - to simply add a qualifier as a Seq[String] as a member of SubqueryAlias but that change would require touching more files.
I had implemented that first (long time back now) and then after some offline discussions we favored this as this change is more localized and encapsulated. @gatorsmile @cloud-fan
There was a problem hiding this comment.
Again, I feel using Seq[String] directly can simplify a lot of code.
There was a problem hiding this comment.
I'll look into this.
There was a problem hiding this comment.
I took care of this change in the recent push. Thanks.
There was a problem hiding this comment.
can you briefly explain the new resolution logic? I feel it's a little convoluted now as we are more likely to be ambiguous.
There was a problem hiding this comment.
The resolution logic needs to account to compare the full qualifier. If you are trying to resolve db1.t1.i1, the match needs to happen against the entire qualifier of the attribute.
For e.g:
select db1.t1.i1 from t1 // current database is db1 and lets say t1 has columns i1, i2. When resolving db1.t1.i1, against the following set of Input attributes.
we will have the following attribute references:
- AttributeReference ( qualifier Seq(db1, t1) and name is i1)
- AttributeReference ( qualifier Seq(db1, t1) and name is i2)
So the new resolution logic will match the entire qualifier sequence and then match the column ( ie the attribute name)
There was a problem hiding this comment.
At a high level the changes include:
- The column resolution needs to account for the qualifier that is now a sequence of string.
- Enhance the lookupRelation to add the qualifier appropriately. Right now, the qualifier used by the LogicalPlan nodes representing the relation leafnodes use only the table name as the qualifier when no alias is provided. This needs to be updated to add the database name and the table name for permanent views and tables and global temporary views.
Full details of the design and the resolution logic is in the doc here in Section 3.3 and 3.4.
So, Section 2 in the design doc has a lot of scenarios that were tested against different db’s and Spark and the expected behavior with the design here.
If there is a specific scenario you have in mind that would be valid but get flagged as ambiguous, can you please let me know. I’d be glad to look into it.
|
@gatorsmile , @cloud-fan, just a quick comment, I have been working on this and will respond soon. |
|
I rebased and found out that the resolution code in Logical plan has changed and it uses map lookup to do the matching. I have some ideas on how to incorporate the 3 part name with the map lookup logic. For now, I have synced up and bypassed the new map logic and pushed the code so it is up to the latest so I can get the test cycle from GitHub and added todo’s for myself to incorporate the optimized lookup logic. @cloud-fan, In the meantime, if you have any other comments please let me know. Thanks. |
|
Test build #93904 has finished for PR 17185 at commit
|
There was a problem hiding this comment.
do we assume the attribute always carry the full qualifier? and i.e. the analyzer will always include the current database in the qualifier.
There was a problem hiding this comment.
The database information gets carried over based on the lookupRelation logic. Details are in Section 3.3.3 in the design doc here
There was a problem hiding this comment.
it's weird to use pattern match like this, maybe better to write
nameParts.corresponds(qualifierList)(resolver) || {
if (nameParts.length == 1 && qualifierList.nonEmpty) {
resolver(nameParts.head, qualifierList.last)
} else {
false
}
}
There was a problem hiding this comment.
IIRC users can do select structCol.* from tbl, does it mean the nameParts.head may not be table name?
There was a problem hiding this comment.
- a seq of 2 elements: database name and table name.
|
overall LGTM, my major concern is how to do O(1) lookup for the 3 part name |
|
Thanks @cloud-fan for the review. I am working on implementing an idea to get optimized lookup with 3part name. |
|
retest this please |
|
retest this please |
There was a problem hiding this comment.
do we need to consider if qualifier is Nil?
There was a problem hiding this comment.
Sure. Will add a defensive check.
There was a problem hiding this comment.
we may need to check the SQL standard about how to resolve the ambiguity. For a.b.c, it's possible that a is db name, b is table name, c is column name. If there is no column named c, we should fail. It's also possible that a is table name, b is column name, c is nested field name, and it exists. What's the expected behavior here?
There was a problem hiding this comment.
In this case, if a.b.c fails to resolve as db.table.column, then we check if there is a table and column that matches a.b and then see if c is a nested field name and if it exists, it will resolve to the nested field.
Tests with struct nested fields are here
There was a problem hiding this comment.
Is this behavior defined by SQL standard? IIRC the current behavior without your patch is
col.innerField1.innerField2 can fail even we have a table named col and a column named innerField1 which has a innerField2 field.
There was a problem hiding this comment.
Actually it won't fail without this patch too, the col.innerField1.innerField2 will resolve correctly to the struct field if there is a table named col and column named innerField1 which has a innerField2 field.
Please see an example of such a scenario in this test output on master spark:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/columnresolution.sql.out#L360
Let me look into the sql standard and get back on that.
There was a problem hiding this comment.
What I'm talking about is ambiguity. col.innerField1.innerField2 can fail if a table has a col column and the col column has a innerField1 field but the innerField2 doesn't exist. My question is: shall we try all the possible resolution paths and pick the valid one, or define a rule that we can decide the resolution path ahead.
There was a problem hiding this comment.
@cloud-fan , Thank you for your suggestion and question.
Existing spark behavior follows precedence rules in the column resolution logic and in this patch we are following the same pattern/rule.
I am looking into the SQL standard to see if there are any column resolution rules but I have not found any yet. However when I researched existing databases, I observed different behaviors among them and it is listed in Section 2/Table A in the design doc here.
I agree, we can improve upon the checks in existing precedence to go all the way to ensure there is a nested field. Although, the user can always qualify the field to resolve the ambiguity. Shall we open another issue to discuss and improve upon the existing resolution logic.
There was a problem hiding this comment.
instead of creating AliasIdentifier manually, we can also define a SubqueryAlias.apply which takes db name and table name.
There was a problem hiding this comment.
Sure, let me do that if this style is preferred. Thanks.
|
Test build #94206 has finished for PR 17185 at commit
|
Changes to resolve fully qualified column attribute references from a logical plan node children
…rt name and remove my todos
|
I have addressed the review comments in this commit here @cloud-fan, please take a look. Thanks. |
|
Test build #94234 has finished for PR 17185 at commit
|
|
retest this please |
|
Test build #94244 has finished for PR 17185 at commit
|
|
|
||
| override def sql: String = { | ||
| val qualifierPrefix = qualifier.map(_ + ".").getOrElse("") | ||
| val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") + "." else "" |
There was a problem hiding this comment.
nit: val qualifierPrefix = qualifier.mkString("", ".", ".")
There was a problem hiding this comment.
This won't work for the case when we have Seq.empty. The suffix "." gets returned even for a empty sequence.
For a non empty Seq, the above call will be fine.
Shall we leave the 'if' as is or is there an equivalent preferred style that would work?
There was a problem hiding this comment.
ah my bad, I thought it would return empty string for empty seq. Let's leave it.
| val key = (qualifier.toLowerCase(Locale.ROOT), name.toLowerCase(Locale.ROOT)) | ||
| val attributes = collectMatches(name, qualified.get(key)).filter { a => | ||
| resolver(qualifier, a.qualifier.get) | ||
| matches = matches match { |
There was a problem hiding this comment.
nit:
if (matches._1.isEmpty) {
matches = ...
}
|
|
||
| def apply( | ||
| identifier: String, | ||
| database: Option[String], |
There was a problem hiding this comment.
why it's Option? if no database name, we should call the apply method above
There was a problem hiding this comment.
good point! I'll take care of this in the next push.
|
LGTM except some code style comment, thanks for working on it! |
|
Thanks for the review. I have addressed your comments and pushed the changes. |
|
Test build #94335 has finished for PR 17185 at commit
|
|
retest this please |
|
Test build #94353 has finished for PR 17185 at commit
|
|
thanks, merging to master! |
|
@skambha @dilipbiswal How about the hint resolution after supporting multi part names? |
|
Let us discuss it in the JIRA https://issues.apache.org/jira/browse/SPARK-25121 |
|
Good idea @gatorsmile . Thanks !! |
What changes were proposed in this pull request?
The design details is attached to the JIRA issue here
High level overview of the changes are:
How was this patch tested?